feat(node): peer partial-mirrors for repos with private subtrees#35
feat(node): peer partial-mirrors for repos with private subtrees#35beardthelion wants to merge 13 commits into
Conversation
Three fixes from the PR Gitlawb#33 review: - withheld_paths now applies the whole-repo "/" read gate (returns repo-not-found when the caller cannot read the root), matching the git read endpoints. Without it the endpoint disclosed a private repo's existence and path layout to unauthorized callers. The withheld_globs doc already assumed this gate existed; now it does. - A nested allow under a denied parent (e.g. "/secret/public/**" allowed, "/secret/**" denied) was over-withheld: the client sparse-excluded the whole parent and hid paths the caller may read. The endpoint now also returns a "reinclude" list (allowed globs strictly under a denied one) and gl clone re-includes them in the sparse spec after the excludes. - Wildcard-free globs like "/docs/private" match both the exact path and a subtree (per glob_matches), but the client only emitted the subtree exclude. sparse_patterns now emits both "/docs/private" and "/docs/private/". Verified the exclude-then-reinclude sparse ordering checks out cleanly with real git, plus unit tests for reincluded_globs, the nested re-include, the exact-path exclude, and sparse_patterns.
…eld-path errors
split_once('/') accepted owner/name/extra, smuggling a path segment into
the repo name that then flowed into the API path and clone URL; reject it.
fetch_withheld swallowed every network/auth/5xx/JSON error into an empty
result, dropping to a stock clone that the node refuses once blobs are
withheld. Now only 404/501 (endpoint unsupported) fall back to empty; the
rest propagate so the real cause surfaces.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughImplements end-to-end partial clone support: visibility helpers compute withheld/reinclude sparse-checkout globs, a new API exposes them, the sync worker classifies promisor mirrors and applies git partial-clone config, and a new ChangesPartial Clone and Sparse Checkout Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/sync.rs`:
- Around line 66-72: The shared reqwest client is created with
reqwest::Client::new() and can hang the sync worker; create the Client with a
default timeout (e.g. via Client::builder().timeout(...).build()) and also wrap
the call in fetch_withheld() with tokio::time::timeout(...) to bound individual
/withheld-paths requests; additionally fix Promisor→Plain rehydration in
classify_mirror()/fetch_repo(): when classify_mirror() yields MirrorMode::Plain
ensure fetch_repo clears remote.origin.promisor and
remote.origin.partialclonefilter (remove or set to None/empty) so an existing
promisor mirror is converted to a full/plain mirror, and add unit/integration
tests that transition a repo from Promisor to Plain to assert those fields are
cleared and rehydration completes.
- Around line 235-257: fetch_repo currently only sets promisor config when mode
== MirrorMode::Promisor but never clears it when classify_mirror determines the
repo moved from Promisor → Plain; update fetch_repo to detect that transition
(use the computed mode from classify_mirror) and when switching to Plain
explicitly unset remote.origin.promisor and remove
remote.origin.partialclonefilter (run git config --unset or set to false/empty
for remote.origin.promisor and git config --unset for
remote.origin.partialclonefilter), then ensure the mirror is rehydrated (e.g.,
trigger a reclone or run the rehydrate logic you use elsewhere) so that stored
promisor state does not leave the mirror incomplete; reference functions/keys:
fetch_repo, classify_mirror, and git config keys remote.origin.promisor and
remote.origin.partialclonefilter.
In `@crates/gitlawb-node/src/visibility.rs`:
- Around line 122-169: reincluded_globs currently collects two unordered buckets
(denied then allowed) which loses nesting order; change its contract to produce
an ordered sequence that preserves rule specificity (e.g., return a Vec<(String,
bool)> or Vec<PatternOp> where bool=true means include and false means exclude)
instead of only allowed globs, implement the logic by iterating over the
original rules in order and for each non-root rule computing Decision via
visibility_check(glob_prefix(&r.path_glob)) and emitting the corresponding
(r.path_glob.clone(), Decision::Allow) or (r.path_glob.clone(), Decision::Deny)
entry so the final sparse-checkout operations maintain the same ordering as the
rules, and add a regression test that applies the three-level scenario
("/secret/**" deny, "/secret/public/**" allow, "/secret/public/admin/**" deny)
to assert the emitted sequence/materialized patterns do not re-include the
deepest denied path.
In `@crates/gl/src/clone.rs`:
- Around line 133-145: In the branch==None clone path, stop calling and parsing
`git remote show origin`; instead read the local ref file
`refs/remotes/origin/HEAD` under the cloned repo (use `dest`), resolve it to the
target branch name and pass that to `git(dest, &["checkout", "-q", &head])` so
default-branch detection is local and robust (replace the Command::new("git")...
block). In `fetch_withheld` (the JSON parsing path that currently uses
`unwrap_or_default()` on `withheld`/`reinclude`), validate that
`Value::get("withheld")` and `Value::get("reinclude")` exist and are arrays (or
return a clear error) and parse their string entries explicitly instead of
defaulting to empty arrays; fail-fast on schema/type mismatches to preserve the
contract described in the doc comment.
- Around line 192-206: fetch_withheld currently tolerates missing/wrong-typed
"withheld" or "reinclude" by falling back to empty Vec, which can silently force
a full clone; update fetch_withheld to validate the response schema and fail on
mismatch: instead of the permissive globs closure, extract both fields from the
parsed JSON and attempt to deserialize each into Vec<String> (e.g. via
serde_json::from_value or .as_array() plus string-checks) and return an error
(with context) if a field is missing or not an array of strings; keep references
to the same symbols (fetch_withheld, the "withheld"/"reinclude" fields and the
behavior consumed by setup_partial_clone) so callers get an Err on schema
mismatch rather than treated as empty lists.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: d37c05af-5217-4263-a8a3-93d66a13fe1b
📒 Files selected for processing (6)
crates/gitlawb-node/src/api/visibility.rscrates/gitlawb-node/src/server.rscrates/gitlawb-node/src/sync.rscrates/gitlawb-node/src/visibility.rscrates/gl/src/clone.rscrates/gl/src/main.rs
Add a regression test for deny /secret, allow /secret/public, deny /secret/public/admin and clarify the sparse-checkout comment. git does not re-traverse an explicitly excluded directory, so emitting all excludes before re-includes keeps the deepest deny in force; the broader parent re-include does not resurrect it.
…hema Read the default branch from the local origin/HEAD symref clone already set, instead of parsing the localized, network-dependent output of git remote show origin. Deserialize the withheld-paths body into a typed struct so a missing or mistyped withheld/reinclude field is a hard error rather than silently becoming an empty list, which would mask a server regression behind a later clone failure.
Give the sync worker's HTTP client a 30s timeout so a stalled peer withheld-paths lookup cannot hang the worker loop. When a repo that was mirrored as a promisor (mode B) becomes fully public, fetch_repo now clears remote.origin.promisor and partialclonefilter and refetches, so the once-withheld blobs are backfilled instead of the mirror staying permanently partial.
fc05ae0 to
25390bf
Compare
Restores replication redundancy for repos that have a private subtree (mode B). After Phase 2 made the sync worker fail closed on any repo with withheld content, those repos only ever lived on their origin node. This lets a peer mirror the public part of such a repo without ever receiving the private blobs.
The sync worker now asks the origin (anonymously) which paths are withheld via the existing
withheld-pathsendpoint, then mirrors accordingly:--filter=blob:limit=10g), which keeps every public object and tolerates the blobs the origin omits for an anonymous callergit clone --mirror/ full fetch, exactly as beforeThe classify step is just an optimization for picking promisor mode. The git layer is the real enforcement: a plain mirror of a mode-B repo rejects the origin's intentionally-incomplete pack, so even if the lookup fails transiently the worst case is a failed sync that retries, never a leak or a corrupt mirror. The promisor mirror physically lacks the private blobs, so re-serving it can't leak them even though the mirror node doesn't hold the origin's visibility rules.
Fetches now go through the configured
originremote (refreshing its URL first) so stored promisor settings are honored on update.Depends on the
withheld-pathsendpoint from #33, so this is branched off that work; the diff will show #33's files until it merges, after which I'll rebase onto main. Independent of #28/#34.Test plan
classify_mirrorover 200+globs, 200+empty, and 404/error.file://bare origin): promisor clone ends uppromisor=true,mirror=true, and retains all objects; plain clone is not a promisor; promisor fetch pulls new commits into an existing mirror.Summary by CodeRabbit